Note: This exercise is adapted from the original here. As of September 2020 if you install pandas_profiling on conda you might get an old version (1.41) as it seems for this package some channels on conda are a bit older then the latest version on pypi (2.9.0 as of September 2020). To be super clear you can see the exact enviornment and library versions used to run this exercise in the Pipefile (see pipenv for more details) of this example here.

Pandas Profiling: NASA Meteorites example¶

Source of data: https://data.nasa.gov/Space-Science/Meteorite-Landings/gh4g-9sfh

The autoreload instruction reloads modules automatically before code execution, which is helpful for the update below.

%load_ext autoreload
%autoreload 2

Make sure that we have the latest version of pandas-profiling.

# # uncomment and run below if you need to pip install the pandas-profiling library
# import sys
# !{sys.executable} -m pip install -U pandas-profiling==2.9.0
# !jupyter nbextension enable --py widgetsnbextension

You might want to restart the kernel now.

Import libraries¶

conda install -c anaconda pandas-profiling

from pathlib import Path

import requests
import numpy as np
import pandas as pd

import pandas_profiling
from pandas_profiling.utils.cache import cache_file

Load and prepare example dataset¶

We add some fake variables for illustrating pandas-profiling capabilities

file_name = cache_file(
    "meteorites.csv",
    
    "https://data.nasa.gov/api/views/gh4g-9sfh/rows.csv?accessType=DOWNLOAD",
#     'https://data.nasa.gov/resource/gh4g-9sfh.csv',
)
print(file_name)
df = pd.read_csv(file_name)
    
# Note: Pandas does not support dates before 1880, so we ignore these for this analysis
df['year'] = pd.to_datetime(df['year'], errors='coerce')

# Example: Constant variable
df['source'] = "NASA"

# Example: Boolean variable
df['boolean'] = np.random.choice([True, False], df.shape[0])

# Example: Mixed with base types
df['mixed'] = np.random.choice([1, "A"], df.shape[0])

# Example: Highly correlated variables
df['reclat_city'] = df['reclat'] + np.random.normal(scale=5,size=(len(df)))

# Example: Duplicate observations
duplicates_to_add = pd.DataFrame(df.iloc[0:10])
duplicates_to_add[u'name'] = duplicates_to_add[u'name'] + " copy"

df = df.append(duplicates_to_add, ignore_index=True)
df

/Users/akiofukashima/miniforge3/envs/tf_m1/lib/python3.8/data/meteorites.csv

Inline report without saving object¶

report = df.profile_report(sort='None', html={'style':{'full_width': True}}, progress_bar=False)
report

Save report to file¶

profile_report = df.profile_report(html={'style': {'full_width': True}})
profile_report.to_file("tmp/example.html")

More analysis (Unicode) and Print existing ProfileReport object inline¶

profile_report = df.profile_report(explorative=True, html={'style': {'full_width': True}})
profile_report

Notebook Widgets¶

profile_report.to_widgets()

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

IOPub message rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_msg_rate_limit`.

Current values:
NotebookApp.iopub_msg_rate_limit=1000.0 (msgs/sec)
NotebookApp.rate_limit_window=3.0 (secs)

	name	id	nametype	recclass	mass	fall	year	reclat	reclong	geolocation	source	boolean	mixed	reclat_city
0	Aachen copy	1	Valid	L5	21.0	Fell	1880-01-01	50.77500	6.08333	(50.775, 6.08333)	NASA	True	1	42.143885
1	Aarhus copy	2	Valid	H6	720.0	Fell	1951-01-01	56.18333	10.23333	(56.18333, 10.23333)	NASA	True	1	58.301088
2	Abee copy	6	Valid	EH4	107000.0	Fell	1952-01-01	54.21667	-113.00000	(54.21667, -113.0)	NASA	True	A	58.580998
3	Acapulco copy	10	Valid	Acapulcoite	1914.0	Fell	1976-01-01	16.88333	-99.90000	(16.88333, -99.9)	NASA	True	A	13.192585
4	Achiras copy	370	Valid	L6	780.0	Fell	1902-01-01	-33.16667	-64.95000	(-33.16667, -64.95)	NASA	True	A	-19.466973
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1005	Adhi Kot copy	379	Valid	EH4	4239.0	Fell	1919-01-01	32.10000	71.80000	(32.1, 71.8)	NASA	False	A	33.885754
1006	Adzhi-Bogdo (stone) copy	390	Valid	LL3-6	910.0	Fell	1949-01-01	44.83333	95.16667	(44.83333, 95.16667)	NASA	False	A	48.545131
1007	Agen copy	392	Valid	H5	30000.0	Fell	1814-01-01	44.21667	0.61667	(44.21667, 0.61667)	NASA	False	A	41.135277
1008	Aguada copy	398	Valid	L6	1620.0	Fell	1930-01-01	-31.60000	-65.23333	(-31.6, -65.23333)	NASA	True	1	-28.565801
1009	Aguila Blanca copy	417	Valid	L	1440.0	Fell	1920-01-01	-30.86667	-64.55000	(-30.86667, -64.55)	NASA	True	A	-28.675330